Introduction

This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data

Course coordinates

spds.uni.kn

Preparations

Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)

## Essential commmands | Data science for psychologists
## 2018 06 24
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##

## Preparations: ----- 

library(tidyverse)

## Topic: ----- 

# ...

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- 

Tibbles

Whenever working with rectangular data structures – data consisting of multiple cases (rows) and variables (columns) – our first step is to create or transform the data into a tibble (i.e., a simple version of a data frame).

Creating tibbles

Basic commands

There are 3 basic commands for creating tibbles:

  1. as_tibble converts (or coerces) an existing data frame into a tibble.

  2. tibble converts several vectors into (the columns of) a tibble.

  3. tribble converts a table (entered row-by-row) into a tibble.

Check: The 3 commands yield the same type of output (i.e., a tibble), but require different inputs. Ask yourself which kind of input each command takes and how this input needs to be structured and formatted (e.g., with commas).

1. as_tibble

Use as_tibble when the data to be used already is in a data frame (or matrix):

## Using the data frame `sleep`: ------ 

# ?datasets::sleep # provides background information on the data set.

# Save the sleep data frame as df: 
df <- datasets::sleep

# Convert df into a tibble tb: 
tb <- as_tibble(df)

# Inspect the data frame df: 
dim(df)
#> [1] 20  3
is.data.frame(df)
#> [1] TRUE
head(df)
#>   extra group ID
#> 1   0.7     1  1
#> 2  -1.6     1  2
#> 3  -0.2     1  3
#> 4  -1.2     1  4
#> 5  -0.1     1  5
#> 6   3.4     1  6
str(df)
#> 'data.frame':    20 obs. of  3 variables:
#>  $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#>  $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

# Inspect the tibble tb:
dim(tb)
#> [1] 20  3
is.tibble(tb)
#> [1] TRUE
is.data.frame(tb) # => tibbles ARE data frames.
#> [1] TRUE
head(tb)
#> # A tibble: 6 x 3
#>   extra  group     ID
#>   <dbl> <fctr> <fctr>
#> 1   0.7      1      1
#> 2  -1.6      1      2
#> 3  -0.2      1      3
#> 4  -1.2      1      4
#> 5  -0.1      1      5
#> 6   3.4      1      6
glimpse(tb)
#> Observations: 20
#> Variables: 3
#> $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1....
#> $ group <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
#> $ ID    <fctr> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, ...

Practice: Convert the data frames datasets::attitude and datasets::iris into tibbles and inspect their dimensions and contents. What types of variables do they contain?

2. tibble

Use tibble when the data to be used appears as a collection of columns. For instance, imagine we have the following information about a family:

Example data of some family.
id name age gender drives married_2
1 Adam 46 male TRUE Eva
2 Eva 48 female TRUE Adam
3 Xaxi 21 female FALSE Zenon
4 Yota 19 female TRUE NA
5 Zack 17 male FALSE NA

One way of viewing this table is as a series of columns. Each column consists of a variable name and the same number of (here: 5) values, which can be of different types (here: numbers, characters, or Boolean truth values). Each column may or may not contain missing values (entered as NA).

The tibble command expects that each column of the table is entered as a vector:

## Create a tibble from vectors (column-by-column): 
fm <- tibble(
  id       = c(1, 2, 3, 4, 5), # OR: id = 1:5, 
  name     = c("Adam", "Eva", "Xaxi", "Yota", "Zack"), 
  age      = c(46, 48, 21, 19, 17), 
  gender   = c("male", rep("female", 3), "male"), 
  drives   = c(TRUE, TRUE, FALSE, TRUE, FALSE), 
  married_2 = c("Eva", "Adam", "Zenon", NA, NA)
  )

fm  # prints the tibble: 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

Note some details:

  • Each vector is labeled by the variable (column) name, which is not put into quotes;

  • Avoid spaces within variable (column) names (or enclose names in single quotes if you really must use spaces);

  • All vectors need to have the same length;

  • Each vector is of a single type (numeric, character, or Boolean truth values);

  • Consecutive vectors are separated by commas (but there is no comma after the final vector).

When using tibble, later vectors may use the values of earlier vectors:

# Using earlier vectors when defining later ones:
abc <- tibble(
  ltr = LETTERS[1:5],
  num = 1:5,
  l_n = paste(ltr, num, sep = "_"),  # combining abc with num
  nsq = num^2                        # squaring num
  )

abc  # prints the tibble: 
#> # A tibble: 5 x 4
#>     ltr   num   l_n   nsq
#>   <chr> <int> <chr> <dbl>
#> 1     A     1   A_1     1
#> 2     B     2   B_2     4
#> 3     C     3   C_3     9
#> 4     D     4   D_4    16
#> 5     E     5   E_5    25

Practice: Find some tabular data online (e.g., on Wikipedia) and enter it as a tibble.

3. tribble

Use tribble when the data to be used appears as a collection of rows (or already is in tabular form).

For instance, when you copy and paste the above family data from an electronic document, it is easy to insert commas between consecutive cell values and use tribble to convert it into a tibble:

## Create a tibble from tabular data (row-by-row): 
fm2 <- tribble(
  ~id, ~name, ~age, ~gender, ~drives, ~married_2,   
  #--|------|-----|--------|----------|----------|
  1,  "Adam", 46,  "male",    TRUE,     "Eva",    
  2,  "Eva",  48,  "female",  TRUE,     "Adam",  
  3,  "Xaxi", 21,  "female",  FALSE,    "Zenon",    
  4,  "Yota", 19,  "female",  TRUE,      NA, 
  5,  "Zack", 17,  "male",    FALSE,     NA      )

fm2  # prints the tibble: 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

Note some details:

  • The column names are preceded by ~;

  • Consecutive entries are separated by a comma (but there is no comma after the final entry);

  • The line #--|------|-----|--------|----------|----------| is commented out and can be omitted;

  • The type of each column is determined by the type of the corresponding cell values. For instance, the NA values in fm2 are missing character values because the entries above were characters (entered in quotes).

Check: If tibble and tribble really are alternative commands, then the contents of our objects fm and fm2 should be identical:

# Are fm and fm2 equal?
all.equal(fm, fm2)
#> [1] TRUE

Practice: Enter the tibble abc by using tribble.

Accessing parts of a tibble

Once we have an R object that is a tibble, we often want to access individual parts of it. We can distinguish between 3 simple cases:

1. Variables (columns)

As each column of a tibble is a vector, obtaining a column amounts to obtaining the corresponding vector. We can access this vector by its name (label) or by its number (column position):

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Get the name column of fm:
fm$name       # by label (with $)
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"
fm[["name"]]  # by label (with [])
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"
fm[[2]]       # by number (with [])
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"

# Get the age column of fm: 
fm$age        # by name (with $)
#> [1] 46 48 21 19 17
fm[["age"]]   # by name (with [])
#> [1] 46 48 21 19 17
fm[[3]]       # by number (with [])
#> [1] 46 48 21 19 17

# Note: The following all yield the same vectors as a tibble:
fm[ , 2] # yields the name vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack
select(fm, 2) 
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack
select(fm, name)
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack

fm[ , 3] # yields the age vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17
select(fm, 3)
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17
select(fm, age)
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17

Practice: Extract the price column of ggplot2::diamonds in at least 3 different ways and verify that they all yield the same mean price.

2. Cases (rows)

Extracting specific rows of a tibble amounts to filtering a tibble and typically yields smaller tibbles (as a row may contain entries of different types). The best way of filtering specific rows of a tibble is using dplyr::filter. However, it’s also possible to specify the desired rows by subsetting (i.e., specifying a condition that results in a Boolean value) and by row number:

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Filter specific rows (by condition):
filter(fm, id > 2)
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
filter(fm, age < 18)
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm %>% filter(drives == TRUE) 
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>
  
# The same filters by using Boolean vectors (subsetting):
fm[fm$id > 2, ]
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
fm[fm$age < 18, ]
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm[fm$drives == TRUE, ]
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>

# The same filters by providing specific row numbers:
fm[3:5, ]  # getting rows 3 to 5 of fm
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
fm[5, ]    # getting row 5 of fm
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm[c(1, 2, 4), ]  # getting rows 1, 2, and 4 of fm
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>

Practice: Extract all diamonds from ggplot2::diamonds that have at least 2 carat. How many of them are there and what is their mean price?

3. Cells

Accessing the values of individual tibble cells is relatively rare, but can be achieved by

a. explicitly providing both row number `r` and column number `c` (as `[r, c]`), or by  
b. first extracting the column (as a vector `v`) and then providing the desired row number `r` (`v[r]`). 
fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Getting specific cell values:
fm$name[4]  # getting the name of the 4th row
#> [1] "Yota"
fm[4, 2]    # getting the same name by row and column numbers
#> # A tibble: 1 x 1
#>    name
#>   <chr>
#> 1  Yota

# Note: What if we don't know the row number? 
which(fm$name == "Yota") # getting the row number that contains the name "Yota"
#> [1] 4

In practice, accessing individual cell values is mostly needed to check for specific cell values and to change or correct erroneous entries by re-assigning them to a different value.

# Checking and changing cell values:

# Check: "Who is Xaxi's spouse?" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2
#> [1] "Zenon"
fm$married_2[3]
#> [1] "Zenon"
fm[3, 6]
#> # A tibble: 1 x 1
#>   married_2
#>       <chr>
#> 1     Zenon

# Change: "Zenon" is actually "Zeus" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2 <- "Zeus"
fm$married_2[3] <- "Zeus"
fm[3, 6] <- "Zeus"

# Check for successful change:
fm
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE      Zeus
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

By contrast, a relatively common task is to check an entire tibble for missing values, count them, or replace them by some other value:

# Checking for, counting, and changing missing values:

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE      Zeus
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# (a) Check for missing values:
is.na(fm)       # checks each cell value for being NA
#>         id  name   age gender drives married_2
#> [1,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [2,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [3,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [4,] FALSE FALSE FALSE  FALSE  FALSE      TRUE
#> [5,] FALSE FALSE FALSE  FALSE  FALSE      TRUE

# (b) Count the number of missing values: 
sum(is.na(fm))  # counts missing values (by adding up all TRUE values)
#> [1] 2

# (c) Change all missing values: 
fm[is.na(fm)] <- "A MISSING value!"

# Check for successful change: 
fm
#> # A tibble: 5 x 6
#>      id  name   age gender drives        married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>            <chr>
#> 1     1  Adam    46   male   TRUE              Eva
#> 2     2   Eva    48 female   TRUE             Adam
#> 3     3  Xaxi    21 female  FALSE             Zeus
#> 4     4  Yota    19 female   TRUE A MISSING value!
#> 5     5  Zack    17   male  FALSE A MISSING value!

Practice: Determine the number and the percentage of missing values in the datasets dplyr::starwars and dplyr::storms.

More advanced operations on tibbles are covered in Chapter 5: Data transformation and involve using the dplyr commands arrange, filter, and select.

More on tibbles

For more details on tibbles,

Data transformation

Overview

When we have data in the form of a tibble or data frame, dplyr provides a range of simple tools to transform this data. Six essential dplyr commands are:

  1. arrange sorts cases (rows);
  2. filter selects cases (rows) by logical conditions;
  3. select selects and reorders variables (columns);
  4. mutate computes new variables (columns) and adds them to existing ones;
  5. summarise collapses multiple values of a variable (rows of a column) to a single one;
  6. group_by changes the unit of aggregation (in combination with mutate and summarise).

Not quite as essential but still useful dplyr commands include:

  • slice selects (ranges of) cases (rows) by number;
  • rename renames variables (columns) and keeps others;
  • transmute computes new variables (columns) and drops existing ones;
  • sample_n and sample_frac draw random samples of cases (rows).

Commands and examples

We save the dplyr::starwars data as a tibble sw and use it to illustrate the essential dplyr commands.

library(tidyverse)
sw <- dplyr::starwars

sw  # => A tibble: 87 rows (individuals) x 13 columns (variables)
#> # A tibble: 87 x 13
#>                  name height  mass    hair_color  skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75          <NA>        gold    yellow
#>  3              R2-D2     96    32          <NA> white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8              R5-D4     97    32          <NA>  white, red       red
#>  9  Biggs Darklighter    183    84         black       light     brown
#> 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Practice: How many sw variables (columns) are there and of which type are they? How many missing (NA) values are there?

1. arrange to sort rows

Using arrange sorts cases (rows) by putting specific variables (columns) in specific orders (e.g., ascending or descending):

# Sort rows alphabetically (by name):
arrange(sw, name)
#> # A tibble: 87 x 13
#>                   name height  mass hair_color          skin_color
#>                  <chr>  <int> <dbl>      <chr>               <chr>
#>  1              Ackbar    180    83       none        brown mottle
#>  2          Adi Gallia    184    50       none                dark
#>  3    Anakin Skywalker    188    84      blond                fair
#>  4        Arvel Crynyd     NA    NA      brown                fair
#>  5         Ayla Secura    178    55       none                blue
#>  6 Bail Prestor Organa    191    NA      black                 tan
#>  7       Barriss Offee    166    50      black              yellow
#>  8                 BB8     NA    NA       none                none
#>  9      Ben Quadinaros    163    65       none grey, green, yellow
#> 10  Beru Whitesun lars    165    75      brown               light
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> #   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# The same command using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  arrange(name) 
#> # A tibble: 87 x 13
#>                   name height  mass hair_color          skin_color
#>                  <chr>  <int> <dbl>      <chr>               <chr>
#>  1              Ackbar    180    83       none        brown mottle
#>  2          Adi Gallia    184    50       none                dark
#>  3    Anakin Skywalker    188    84      blond                fair
#>  4        Arvel Crynyd     NA    NA      brown                fair
#>  5         Ayla Secura    178    55       none                blue
#>  6 Bail Prestor Organa    191    NA      black                 tan
#>  7       Barriss Offee    166    50      black              yellow
#>  8                 BB8     NA    NA       none                none
#>  9      Ben Quadinaros    163    65       none grey, green, yellow
#> 10  Beru Whitesun lars    165    75      brown               light
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> #   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# Sort rows in descending order:
sw %>% 
  arrange(desc(name)) 
#> # A tibble: 87 x 13
#>                     name height  mass   hair_color          skin_color
#>                    <chr>  <int> <dbl>        <chr>               <chr>
#>  1            Zam Wesell    168    55       blonde fair, green, yellow
#>  2                  Yoda     66    17        white               green
#>  3           Yarael Poof    264    NA         none               white
#>  4        Wilhuff Tarkin    180    NA auburn, grey                fair
#>  5 Wicket Systri Warrick     88    20        brown               brown
#>  6        Wedge Antilles    170    77        brown                fair
#>  7                 Watto    137    NA        black          blue, grey
#>  8            Wat Tambor    193    48         none         green, grey
#>  9            Tion Medon    206    80         none                grey
#> 10               Taun We    213    NA         none                grey
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> #   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# Sort by multiple variables:
sw %>% 
  arrange(eye_color, gender, desc(height))
#> # A tibble: 87 x 13
#>          name height  mass hair_color       skin_color eye_color
#>         <chr>  <int> <dbl>      <chr>            <chr>     <chr>
#>  1    Taun We    213    NA       none             grey     black
#>  2   Shaak Ti    178    57       none red, blue, white     black
#>  3    Lama Su    229    88       none             grey     black
#>  4 Tion Medon    206    80       none             grey     black
#>  5  Kit Fisto    196    87       none            green     black
#>  6   Plo Koon    188    80       none           orange     black
#>  7     Greedo    173    74       <NA>            green     black
#>  8  Nien Nunb    160    68       none             grey     black
#>  9    Gasgano    122    NA       none      white, blue     black
#> 10        BB8     NA    NA       none             none     black
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## Note: See 
# ?dplyr::arrange  # for more help and examples.

Note some details:

  • All basic dplyr commands can be called as verb(data, ...) or – using the pipe from magrittr – as data %>% verb(...) (see vignette("magrittr") for details).

  • Variable names are unquoted.

  • The order of variable names (x, y, ...) specifies the order or priority of operations (first by x, then by y, etc.).

Practice: Arrange the sw data in different ways, combining multiple variables and (ascending and descending) orders. Where are cases containing NA values in sorted variables placed?

2. filter to select rows

Using filter selects cases (rows) by logical conditions. It keeps all rows for which the conditions are TRUE and drops all rows for which the conditions are FALSE or NA.

# Filter to keep all humans:
filter(sw, species == "Human")
#> # A tibble: 35 x 13
#>                  name height  mass    hair_color skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>      <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond       fair      blue
#>  2        Darth Vader    202   136          none      white    yellow
#>  3        Leia Organa    150    49         brown      light     brown
#>  4          Owen Lars    178   120   brown, grey      light      blue
#>  5 Beru Whitesun lars    165    75         brown      light      blue
#>  6  Biggs Darklighter    183    84         black      light     brown
#>  7     Obi-Wan Kenobi    182    77 auburn, white       fair blue-gray
#>  8   Anakin Skywalker    188    84         blond       fair      blue
#>  9     Wilhuff Tarkin    180    NA  auburn, grey       fair      blue
#> 10           Han Solo    180    80         brown       fair     brown
#> # ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# The same command using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  filter(species == "Human")
#> # A tibble: 35 x 13
#>                  name height  mass    hair_color skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>      <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond       fair      blue
#>  2        Darth Vader    202   136          none      white    yellow
#>  3        Leia Organa    150    49         brown      light     brown
#>  4          Owen Lars    178   120   brown, grey      light      blue
#>  5 Beru Whitesun lars    165    75         brown      light      blue
#>  6  Biggs Darklighter    183    84         black      light     brown
#>  7     Obi-Wan Kenobi    182    77 auburn, white       fair blue-gray
#>  8   Anakin Skywalker    188    84         blond       fair      blue
#>  9     Wilhuff Tarkin    180    NA  auburn, grey       fair      blue
#> 10           Han Solo    180    80         brown       fair     brown
#> # ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Filter by multiple (additive) conditions: 
sw %>%
  filter(height > 180, mass <= 75)  # tall and light individuals
#> # A tibble: 3 x 13
#>            name height  mass hair_color  skin_color eye_color birth_year
#>           <chr>  <int> <dbl>      <chr>       <chr>     <chr>      <dbl>
#> 1 Jar Jar Binks    196    66       none      orange    orange         52
#> 2    Adi Gallia    184    50       none        dark      blue         NA
#> 3    Wat Tambor    193    48       none green, grey   unknown         NA
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# The same command using the logical operator (&): 
sw %>%
  filter(height > 180 & mass <= 75)  # tall and light individuals
#> # A tibble: 3 x 13
#>            name height  mass hair_color  skin_color eye_color birth_year
#>           <chr>  <int> <dbl>      <chr>       <chr>     <chr>      <dbl>
#> 1 Jar Jar Binks    196    66       none      orange    orange         52
#> 2    Adi Gallia    184    50       none        dark      blue         NA
#> 3    Wat Tambor    193    48       none green, grey   unknown         NA
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# Filter for a range of a specific variable:
sw %>%
  filter(height >= 150, height <= 165)  # (a) using height twice
#> # A tibble: 9 x 13
#>                 name height  mass hair_color          skin_color eye_color
#>                <chr>  <int> <dbl>      <chr>               <chr>     <chr>
#> 1        Leia Organa    150    49      brown               light     brown
#> 2 Beru Whitesun lars    165    75      brown               light      blue
#> 3         Mon Mothma    150    NA     auburn                fair      blue
#> 4          Nien Nunb    160    68       none                grey     black
#> 5     Shmi Skywalker    163    NA      black                fair     brown
#> 6     Ben Quadinaros    163    65       none grey, green, yellow    orange
#> 7              Cordé    157    NA      brown               light     brown
#> 8              Dormé    165    NA      brown               light     brown
#> 9      Padmé Amidala    165    45      brown               light     brown
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> #   starships <list>

sw %>%
  filter(between(height, 150, 165))     # (b) using between(...)
#> # A tibble: 9 x 13
#>                 name height  mass hair_color          skin_color eye_color
#>                <chr>  <int> <dbl>      <chr>               <chr>     <chr>
#> 1        Leia Organa    150    49      brown               light     brown
#> 2 Beru Whitesun lars    165    75      brown               light      blue
#> 3         Mon Mothma    150    NA     auburn                fair      blue
#> 4          Nien Nunb    160    68       none                grey     black
#> 5     Shmi Skywalker    163    NA      black                fair     brown
#> 6     Ben Quadinaros    163    65       none grey, green, yellow    orange
#> 7              Cordé    157    NA      brown               light     brown
#> 8              Dormé    165    NA      brown               light     brown
#> 9      Padmé Amidala    165    45      brown               light     brown
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> #   starships <list>

# Filter by multiple (alternative) conditions: 
sw %>%
  filter(homeworld == "Kashyyyk" | skin_color == "green")
#> # A tibble: 8 x 13
#>                name height  mass hair_color skin_color eye_color
#>               <chr>  <int> <dbl>      <chr>      <chr>     <chr>
#> 1         Chewbacca    228   112      brown    unknown      blue
#> 2            Greedo    173    74       <NA>      green     black
#> 3              Yoda     66    17      white      green     brown
#> 4             Bossk    190   113       none      green       red
#> 5        Rugor Nass    206    NA       none      green    orange
#> 6         Kit Fisto    196    87       none      green     black
#> 7 Poggle the Lesser    183    80       none      green    yellow
#> 8           Tarfful    234   136      brown      brown      blue
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> #   starships <list>

# Filter cases with missing (NA) values on specific variables:
sw %>%
  filter(is.na(gender))
#> # A tibble: 3 x 13
#>    name height  mass hair_color  skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl>      <chr>       <chr>     <chr>      <dbl>  <chr>
#> 1 C-3PO    167    75       <NA>        gold    yellow        112   <NA>
#> 2 R2-D2     96    32       <NA> white, blue       red         33   <NA>
#> 3 R5-D4     97    32       <NA>  white, red       red         NA   <NA>
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Filter cases with existing (non-NA) values on specific variables:
sw %>%
  filter(!is.na(mass), !is.na(birth_year))
#> # A tibble: 36 x 13
#>                  name height  mass    hair_color  skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75          <NA>        gold    yellow
#>  3              R2-D2     96    32          <NA> white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8  Biggs Darklighter    183    84         black       light     brown
#>  9     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> 10   Anakin Skywalker    188    84         blond        fair      blue
#> # ... with 26 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## Note: See 
# ?dplyr::filter  # for more help and examples.

Note some details:

  • Separating multiple conditions by commas is the same as the logical AND (&).

  • Variable names are unquoted.

  • The comma between conditions or tests (x, y, ...) means the same as & (logical AND), as each test results in a vector of Boolean values.

  • Unlike in base R, rows for which the condition evaluates to NA are dropped.

  • Additional filter functions include near() for testing numerical (near-)identity.

Practice: Use filter on sw to select very diverse or narrow subsets of individuals. For instance,

  • which individual with blond hair and blue eyes has an unknown mass?
  • of which species are individuals that are over 2m tall and have brown hair?
  • which individuals from Tatooine are not male (but may be NA)?
  • which individuals are neither male nor female OR heavier than 130kg?

3. select to select columns

Using select selects variables (columns) by their names or numbers:

# Select 4 specific variables (columns) of sw:
select(sw, name, species, birth_year, gender)
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# The same when using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  select(name, species, birth_year, gender)
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# The same when providing a vector of variable names: 
sw %>%
  select(c(name, species, birth_year, gender)) 
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# The same when providing column numbers:
sw %>%
  select(1, 10, 7, 8) 
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# The same when providing a vector of column numbers: 
sw %>%
  select(c(1, 10, 7, 8)) 
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# Select ranges of variables with ":":
sw %>%
  select(name:mass, films:starships)
#> # A tibble: 87 x 6
#>                  name height  mass     films  vehicles starships
#>                 <chr>  <int> <dbl>    <list>    <list>    <list>
#>  1     Luke Skywalker    172    77 <chr [5]> <chr [2]> <chr [2]>
#>  2              C-3PO    167    75 <chr [6]> <chr [0]> <chr [0]>
#>  3              R2-D2     96    32 <chr [7]> <chr [0]> <chr [0]>
#>  4        Darth Vader    202   136 <chr [4]> <chr [0]> <chr [1]>
#>  5        Leia Organa    150    49 <chr [5]> <chr [1]> <chr [0]>
#>  6          Owen Lars    178   120 <chr [3]> <chr [0]> <chr [0]>
#>  7 Beru Whitesun lars    165    75 <chr [3]> <chr [0]> <chr [0]>
#>  8              R5-D4     97    32 <chr [1]> <chr [0]> <chr [0]>
#>  9  Biggs Darklighter    183    84 <chr [1]> <chr [0]> <chr [1]>
#> 10     Obi-Wan Kenobi    182    77 <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows

# Select to re-order variables (columns) with everything():
sw %>%
  select(species, name, gender, everything())
#> # A tibble: 87 x 13
#>    species               name gender height  mass    hair_color
#>      <chr>              <chr>  <chr>  <int> <dbl>         <chr>
#>  1   Human     Luke Skywalker   male    172    77         blond
#>  2   Droid              C-3PO   <NA>    167    75          <NA>
#>  3   Droid              R2-D2   <NA>     96    32          <NA>
#>  4   Human        Darth Vader   male    202   136          none
#>  5   Human        Leia Organa female    150    49         brown
#>  6   Human          Owen Lars   male    178   120   brown, grey
#>  7   Human Beru Whitesun lars female    165    75         brown
#>  8   Droid              R5-D4   <NA>     97    32          <NA>
#>  9   Human  Biggs Darklighter   male    183    84         black
#> 10   Human     Obi-Wan Kenobi   male    182    77 auburn, white
#> # ... with 77 more rows, and 7 more variables: skin_color <chr>,
#> #   eye_color <chr>, birth_year <dbl>, homeworld <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Select variables with helper functions:
sw %>%
  select(starts_with("s"))
#> # A tibble: 87 x 3
#>     skin_color species starships
#>          <chr>   <chr>    <list>
#>  1        fair   Human <chr [2]>
#>  2        gold   Droid <chr [0]>
#>  3 white, blue   Droid <chr [0]>
#>  4       white   Human <chr [1]>
#>  5       light   Human <chr [0]>
#>  6       light   Human <chr [0]>
#>  7       light   Human <chr [0]>
#>  8  white, red   Droid <chr [0]>
#>  9       light   Human <chr [1]>
#> 10        fair   Human <chr [5]>
#> # ... with 77 more rows

sw %>%
  select(ends_with("s"))
#> # A tibble: 87 x 5
#>     mass species     films  vehicles starships
#>    <dbl>   <chr>    <list>    <list>    <list>
#>  1    77   Human <chr [5]> <chr [2]> <chr [2]>
#>  2    75   Droid <chr [6]> <chr [0]> <chr [0]>
#>  3    32   Droid <chr [7]> <chr [0]> <chr [0]>
#>  4   136   Human <chr [4]> <chr [0]> <chr [1]>
#>  5    49   Human <chr [5]> <chr [1]> <chr [0]>
#>  6   120   Human <chr [3]> <chr [0]> <chr [0]>
#>  7    75   Human <chr [3]> <chr [0]> <chr [0]>
#>  8    32   Droid <chr [1]> <chr [0]> <chr [0]>
#>  9    84   Human <chr [1]> <chr [0]> <chr [1]>
#> 10    77   Human <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows

sw %>%
  select(contains("_"))
#> # A tibble: 87 x 4
#>       hair_color  skin_color eye_color birth_year
#>            <chr>       <chr>     <chr>      <dbl>
#>  1         blond        fair      blue       19.0
#>  2          <NA>        gold    yellow      112.0
#>  3          <NA> white, blue       red       33.0
#>  4          none       white    yellow       41.9
#>  5         brown       light     brown       19.0
#>  6   brown, grey       light      blue       52.0
#>  7         brown       light      blue       47.0
#>  8          <NA>  white, red       red         NA
#>  9         black       light     brown       24.0
#> 10 auburn, white        fair blue-gray       57.0
#> # ... with 77 more rows

sw %>%
  select(matches("or"))
#> # A tibble: 87 x 4
#>       hair_color  skin_color eye_color homeworld
#>            <chr>       <chr>     <chr>     <chr>
#>  1         blond        fair      blue  Tatooine
#>  2          <NA>        gold    yellow  Tatooine
#>  3          <NA> white, blue       red     Naboo
#>  4          none       white    yellow  Tatooine
#>  5         brown       light     brown  Alderaan
#>  6   brown, grey       light      blue  Tatooine
#>  7         brown       light      blue  Tatooine
#>  8          <NA>  white, red       red  Tatooine
#>  9         black       light     brown  Tatooine
#> 10 auburn, white        fair blue-gray   Stewjon
#> # ... with 77 more rows

# Renaming variables:
sw %>%
  rename(creature = name, from_planet = homeworld)
#> # A tibble: 87 x 13
#>              creature height  mass    hair_color  skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75          <NA>        gold    yellow
#>  3              R2-D2     96    32          <NA> white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8              R5-D4     97    32          <NA>  white, red       red
#>  9  Biggs Darklighter    183    84         black       light     brown
#> 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, from_planet <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## Note: See 
# ?dplyr::select  # for more help and examples. 
?dplyr::select_if  # for more help and examples. 

Note some details:

  • select works both by specifying variable (column) names and by specifying column numbers.

  • Variable names are unquoted.

  • The sequence of variable names (separated by commas) specifies the order of columns in the resulting tibble.

  • Selecting and adding everything() allows re-ordering.

  • Various helper functions (e.g., starts_with, ends_with, contains, matches, num_range) refer to (parts of) variable names.

  • rename renames specified variables (without quotes) and keeps all other variables.

Practice: Use select on sw to select and re-order specific subsets of variables (e.g., all variables starting with “h”, all even columns, all character variables, etc.).

4. mutate to compute new variables

Using mutate computes new variables (columns) from scratch or existing ones:

# Preparation: Save only a subset variables of sw as sws:   
sws <- select(sw, name:mass, birth_year:species) 
sws    # => 87 cases (rows), but only 7 variables (columns)
#> # A tibble: 87 x 7
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows

# Compute 2 new variables and add them to existing ones:
mutate(sws, id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows, and 2 more variables: id <int>, height_feet <dbl>

# The same using the pipe:
sws %>%
  mutate(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows, and 2 more variables: id <int>, height_feet <dbl>

# Transmute commputes and only keeps new variables:
sws %>%
  transmute(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 2
#>       id height_feet
#>    <int>       <dbl>
#>  1     1    5.643045
#>  2     2    5.479003
#>  3     3    3.149606
#>  4     4    6.627297
#>  5     5    4.921260
#>  6     6    5.839895
#>  7     7    5.413386
#>  8     8    3.182415
#>  9     9    6.003937
#> 10    10    5.971129
#> # ... with 77 more rows

# Compute variables based on multiple others (including computed ones):
sws %>%
  mutate(BMI = mass / ((height / 100)  ^ 2),  # compute body mass index (kg/m^2)
         BMI_low  = BMI < 18.5,               # classify low BMI values
         BMI_high = BMI > 30,                 # classify high BMI values
         BMI_norm = !BMI_low & !BMI_high      # classify normal BMI values 
         )
#> # A tibble: 87 x 11
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows, and 4 more variables: BMI <dbl>, BMI_low <lgl>,
#> #   BMI_high <lgl>, BMI_norm <lgl>

## Note: See 
# ?dplyr::mutate  # for more help and examples. 

Note some details:

  • mutate computes new variables (columns) and adds them to existing ones, while transmute drops existing ones.

  • Each mutate command specifies a new variable name (without quotes), followed by = and a rule for computing the new variable from existing ones.

  • Variable names are unquoted.

  • Multiple mutate steps are separated by commas, each of which creates a new variable.

  • See http://r4ds.had.co.nz/transform.html#mutate-funs for useful functions for creating new variables.

Practice: Compute a new variable mass_pound from mass (in kg) and the age of each individual in sw relative to Yoda’s age. (Note that the variable birth_year is provided in years BBY, i.e., Before Battle of Yavin.)

5. summarise to compute summaries

summarise computes a function for a specified variable and collapses the values of the specified variable (i.e., the rows of a specified columns) to a single value. It provides many different summary statistics by itself, but is even more useful in combination with group_by (discussed next).

# Summarise allows computing a function for a variable (column): 
summarise(sw, mn_mass = mean(mass, na.rm = TRUE))  # => 97.31 kg 
#> # A tibble: 1 x 1
#>    mn_mass
#>      <dbl>
#> 1 97.31186

# The same using the pipe: 
sw %>%
  summarise(mn_mass = mean(mass, na.rm = TRUE))  # => 97.31 kg 
#> # A tibble: 1 x 1
#>    mn_mass
#>      <dbl>
#> 1 97.31186

# Multiple summarise steps allow applying 
# different functions for 1 dependent variable: 
sw %>%
  summarise(n_mass = sum(!is.na(mass)), 
            mn_mass = mean(mass, na.rm = TRUE),
            md_mass = median(mass, na.rm = TRUE),
            sd_mass = sd(mass, na.rm = TRUE),
            max_mass = max(mass, na.rm = TRUE),
            big_mass = any(mass > 1000)
            )
#> # A tibble: 1 x 6
#>   n_mass  mn_mass md_mass  sd_mass max_mass big_mass
#>    <int>    <dbl>   <dbl>    <dbl>    <dbl>    <lgl>
#> 1     59 97.31186      79 169.4572     1358     TRUE
            
# Multiple summarise steps also allow applying 
# different functions to different dependent variables: 
sw %>%
  summarise(# Descriptives of height:  
            n_height = sum(!is.na(height)), 
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE), 
            # Descriptives of mass:
            n_mass = sum(!is.na(mass)), 
            mn_mass = mean(mass, na.rm = TRUE),
            sd_mass = sd(mass, na.rm = TRUE),
            # Counts of character variables:
            n_names = n(), 
            n_species = n_distinct(species),
            n_worlds = n_distinct(homeworld)
            )
#> # A tibble: 1 x 9
#>   n_height mn_height sd_height n_mass  mn_mass  sd_mass n_names n_species
#>      <int>     <dbl>     <dbl>  <int>    <dbl>    <dbl>   <int>     <int>
#> 1       81   174.358  34.77043     59 97.31186 169.4572      87        38
#> # ... with 1 more variables: n_worlds <int>

## Note: See 
# ?dplyr::summarise  # for more help and examples. 

Note some details:

  • summarise collapses multiple values into one value and returns a new tibble with as many rows as values computed.

  • Each summarise step specifies a new variable name (without quotes), followed by =, and a function for computing the new variable from existing ones.

  • Multiple summarise steps are separated by commas.

  • Variable names are unquoted.

  • See https://dplyr.tidyverse.org/reference/summarise.html for examples and useful functions in combination with summarise.

Practice: Apply all summary functions mentioned in ?dplyr::summarise to the sw dataset.

6. group_by to aggregate variables

Using group_by does not change the data, but the unit of aggregation for other commands, which is very useful in combination with mutate and summarise.

# Grouping does not change the data, but lists its groups: 
group_by(sws, species)  # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups:   species [38]
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows

# The same using the pipe: 
sws %>%
  group_by(species)  # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups:   species [38]
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows

# group_by is ineffective by itself, but very powerful 
# (a) in combination with `mutate` and 
# (b) in combination with `summarise`. 

# ad (a):
# In combination with mutate and an aggregation function, 
# group_by changes the unit of aggregation:

sws %>%
  mutate(mn_height_1 = mean(height, na.rm = TRUE)) %>%  # aggregates over ALL cases
  group_by(species) %>%
  mutate(mn_height_2 = mean(height, na.rm = TRUE)) %>%  # aggregates over current group (species)
  group_by(gender) %>%
  mutate(mn_height_3 = mean(height, na.rm = TRUE)) %>%  # aggregates over current group (gender)
  group_by(name) %>%
  mutate(mn_height_4 = mean(height, na.rm = TRUE))      # aggregates over current group (name)
#> # A tibble: 87 x 11
#> # Groups:   name [87]
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows, and 4 more variables: mn_height_1 <dbl>,
#> #   mn_height_2 <dbl>, mn_height_3 <dbl>, mn_height_4 <dbl>

# ad (b):
# group_by is particularly useful in combination 
# with summarise:

sws %>%
  group_by(homeworld) %>%
  summarise(count = n(),
            mn_height = mean(height, na.rm = TRUE),
            mn_mass = mean(mass, na.rm = TRUE)
            )
#> # A tibble: 49 x 4
#>         homeworld count mn_height mn_mass
#>             <chr> <int>     <dbl>   <dbl>
#>  1       Alderaan     3  176.3333    64.0
#>  2    Aleen Minor     1   79.0000    15.0
#>  3         Bespin     1  175.0000    79.0
#>  4     Bestine IV     1  180.0000   110.0
#>  5 Cato Neimoidia     1  191.0000    90.0
#>  6          Cerea     1  198.0000    82.0
#>  7       Champala     1  196.0000     NaN
#>  8      Chandrila     1  150.0000     NaN
#>  9   Concord Dawn     1  183.0000    79.0
#> 10       Corellia     2  175.0000    78.5
#> # ... with 39 more rows

# Note that this pipe returns a new tibble, 
# with 49 rows (= different levels of homeworld) and 
# - 1 column of the group variable (homeworld) and 
# - 3 columns of the 3 newly summarised variables.


# group_by used with multiple variables yields a tibble 
# containing the combination of all variable levels: 
sw %>%
  group_by(hair_color, eye_color)  # => 35 groups (combinations)
#> # A tibble: 87 x 13
#> # Groups:   hair_color, eye_color [35]
#>                  name height  mass    hair_color  skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75          <NA>        gold    yellow
#>  3              R2-D2     96    32          <NA> white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8              R5-D4     97    32          <NA>  white, red       red
#>  9  Biggs Darklighter    183    84         black       light     brown
#> 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Counting the frequency of cases in groups:
sw %>%
  group_by(hair_color, eye_color) %>%
  count() %>%
  arrange(desc(n))  
#> # A tibble: 35 x 3
#> # Groups:   hair_color, eye_color [35]
#>    hair_color eye_color     n
#>         <chr>     <chr> <int>
#>  1      black     brown     9
#>  2      brown     brown     9
#>  3       none     black     9
#>  4      brown      blue     7
#>  5       none    orange     7
#>  6       none    yellow     6
#>  7      blond      blue     3
#>  8       none      blue     3
#>  9       none       red     3
#> 10      black      blue     2
#> # ... with 25 more rows

# The same using summarise:
sw %>%
  group_by(hair_color, eye_color) %>%
  summarise(n = n()) %>%
  arrange(desc(n))  
#> # A tibble: 35 x 3
#> # Groups:   hair_color [13]
#>    hair_color eye_color     n
#>         <chr>     <chr> <int>
#>  1      black     brown     9
#>  2      brown     brown     9
#>  3       none     black     9
#>  4      brown      blue     7
#>  5       none    orange     7
#>  6       none    yellow     6
#>  7      blond      blue     3
#>  8       none      blue     3
#>  9       none       red     3
#> 10      black      blue     2
#> # ... with 25 more rows

## Note: See 
# ?dplyr::group_by  # for more help and examples. 

Note some details:

  • group_by changes the unit of aggregation for other commands (mutate and summarise).

  • Variable names are unquoted.

  • When using group_by with multiple variables, they are separated by commas.

  • Using group_by with mutate results in a tibble that has the same number of cases (rows) as the original tibble. By contrast, using group_by with summarise results in a new tibble with all combinations of variable levels as its cases (rows).

Practice: Create some groups and compute descriptive statistics (n, mean, median, standard deviation, …) for some variables. For instance,

  • What is the number and mean height and mass of individuals from Tatooine by species and gender?

  • Which humans are more than 5cm taller then the average human overall?

  • Which humans are more than 5cm taller than the average human of their own gender?

Combining commands

The essential dplyr commands are quite simple by themselves, but form the basic verbs of a language for data manipulation. The commands become particularly powerful when they are combined into pipes (by using %>%). Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly.

Practice: Tidyverse meets universe

Answer the following questions about the dplyr::starwars dataset by using pipes of essential dplyr commands:

a. Basics:

  • Save the tibble dplyr::starwars as sw and report its dimensions.

b. Missing values and known unknowns:

  • How many missing (NA) values does sw contain?

  • Which individuals come from an unknown (missing) homeworld but have a known birth_year or known mass?

c. Gender issues:

  • How many humans are contained in sw overall and by gender?

  • How many and which individuals in sw are neither male nor female?

  • Of which species in sw exist at least 2 different gender values?

d. Popular homes and heights:

  • From which homeworld do the most indidividuals (rows) come from?

  • What is the mean height of all individuals with orange eyes from the most popular homeworld?

e. Size and mass issues:

  • Compute the median, mean, and standard deviation of height for all droids.

  • Compute the average height and mass by species and save the result as h_m.

  • Sort h_m to list the 3 species with the smallest individuals (in terms of mean height).

  • Sort h_m to list the 3 species with the heaviest individuals (in terms of median mass).

f. Counting and arranging:

  • How many individuals exist of the three most frequent (known) species?

g. Grouped mutates:

  • Which individuals are more than 20% lighter than the average mass of individuals of their own homeworld?
# library(tidyverse)
# ?dplyr::starwars

## (a) Basic data properties: ---- 
sw <- dplyr::starwars
dim(sw)  # => 87 rows (denoting individuals) x 13 columns (variables) 
#> [1] 87 13

## (b) Missing data: ----- 

## (+) How many missing data points?
sum(is.na(sw))  # => 101 missing values.
#> [1] 101

# (+) Which individuals come from an unknown (missing) homeworld 
#     but have a known birth_year or mass? 
sw %>% 
  filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#>           name height  mass hair_color skin_color eye_color birth_year
#>          <chr>  <int> <dbl>      <chr>      <chr>     <chr>      <dbl>
#> 1         Yoda     66    17      white      green     brown        896
#> 2        IG-88    200   140       none      metal       red         15
#> 3 Qui-Gon Jinn    193    89      brown       fair      blue         92
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>


## (x) Which variable (column) has the most missing values?
colSums(is.na(sw))  # => birth_year has 44 missing values
#>       name     height       mass hair_color skin_color  eye_color 
#>          0          6         28          5          0          0 
#> birth_year     gender  homeworld    species      films   vehicles 
#>         44          3         10          5          0          0 
#>  starships 
#>          0
colMeans(is.na(sw)) #    (amounting to 50.1% of all cases). 
#>       name     height       mass hair_color skin_color  eye_color 
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000 
#> birth_year     gender  homeworld    species      films   vehicles 
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000 
#>  starships 
#> 0.00000000

## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald": 
# sw$hair_color[is.na(sw$hair_color)] <- "bald"


## (c) Gender issues: ----- 

# (+) How many humans are there of each gender?
sw %>% 
  filter(species == "Human") %>%
  group_by(gender) %>%
  count()
#> # A tibble: 2 x 2
#> # Groups:   gender [2]
#>   gender     n
#>    <chr> <int>
#> 1 female     9
#> 2   male    26

## Answer: 35 Humans in total: 9 females, 26 male.

# (+) How many and which individuals are neither male nor female?
sw %>% 
  filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#>                    name height  mass hair_color       skin_color eye_color
#>                   <chr>  <int> <dbl>      <chr>            <chr>     <chr>
#> 1 Jabba Desilijic Tiure    175  1358       <NA> green-tan, brown    orange
#> 2                 IG-88    200   140       none            metal       red
#> 3                   BB8     NA    NA       none             none     black
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> #   starships <list>

# (+) Of which species are there at least 2 different gender values?
sw %>%
  group_by(species, gender) %>%
  count() %>%  # table shows species by gender: 
  group_by(species) %>%  # Which species appear more than once in this table? 
  count() %>%
  filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups:   species [5]
#>    species    nn
#>      <chr> <int>
#> 1    Droid     2
#> 2    Human     2
#> 3 Kaminoan     2
#> 4  Twi'lek     2
#> 5     <NA>     2

## (d) Homeworld issues: ----- 

# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from? 
sw %>%
  group_by(homeworld) %>%
  count() %>%
  arrange(desc(n))
#> # A tibble: 49 x 2
#> # Groups:   homeworld [49]
#>    homeworld     n
#>        <chr> <int>
#>  1     Naboo    11
#>  2  Tatooine    10
#>  3      <NA>    10
#>  4  Alderaan     3
#>  5 Coruscant     3
#>  6    Kamino     3
#>  7  Corellia     2
#>  8  Kashyyyk     2
#>  9    Mirial     2
#> 10    Ryloth     2
#> # ... with 39 more rows
# => Naboo (with 11 individuals)

# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld? 
sw %>% 
  filter(homeworld == "Naboo", eye_color == "orange") %>%
  summarise(n = n(),
            mn_height = mean(height))
#> # A tibble: 1 x 2
#>       n mn_height
#>   <int>     <dbl>
#> 1     3  208.6667

## Note: 
sw %>% filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 x 13
#>                    name height  mass hair_color          skin_color
#>                   <chr>  <int> <dbl>      <chr>               <chr>
#> 1 Jabba Desilijic Tiure    175  1358       <NA>    green-tan, brown
#> 2                Ackbar    180    83       none        brown mottle
#> 3         Jar Jar Binks    196    66       none              orange
#> 4          Roos Tarpals    224    82       none                grey
#> 5            Rugor Nass    206    NA       none               green
#> 6               Sebulba    112    40       none           grey, red
#> 7        Ben Quadinaros    163    65       none grey, green, yellow
#> 8           Saesee Tiin    188    NA       none                pale
#> # ... with 8 more variables: eye_color <chr>, birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>


# (+) What is the mass and homeworld of the smallest droid?
sw %>% 
  filter(species == "Droid") %>%
  arrange(height)
#> # A tibble: 5 x 13
#>    name height  mass hair_color  skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl>      <chr>       <chr>     <chr>      <dbl>  <chr>
#> 1 R2-D2     96    32       <NA> white, blue       red         33   <NA>
#> 2 R5-D4     97    32       <NA>  white, red       red         NA   <NA>
#> 3 C-3PO    167    75       <NA>        gold    yellow        112   <NA>
#> 4 IG-88    200   140       none       metal       red         15   none
#> 5   BB8     NA    NA       none        none     black         NA   none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## (e) Size and mass: Group summaries: ----- 

# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
  filter(species == "Droid") %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            md_height = median(height, na.rm = TRUE),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#>       n not_NA_h md_height mn_height sd_height
#>   <int>    <int>     <dbl>     <dbl>     <dbl>
#> 1     5        4       132       140  52.00641

# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
  group_by(species) %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            not_NA_m = sum(!is.na(mass)),
            md_mass = median(mass, na.rm = TRUE)
            )
h_m
#> # A tibble: 38 x 6
#>      species     n not_NA_h mn_height not_NA_m md_mass
#>        <chr> <int>    <int>     <dbl>    <int>   <dbl>
#>  1    Aleena     1        1   79.0000        1    15.0
#>  2  Besalisk     1        1  198.0000        1   102.0
#>  3    Cerean     1        1  198.0000        1    82.0
#>  4  Chagrian     1        1  196.0000        0      NA
#>  5  Clawdite     1        1  168.0000        1    55.0
#>  6     Droid     5        4  140.0000        4    53.5
#>  7       Dug     1        1  112.0000        1    40.0
#>  8      Ewok     1        1   88.0000        1    20.0
#>  9 Geonosian     1        1  183.0000        1    80.0
#> 10    Gungan     3        3  208.6667        2    74.0
#> # ... with 28 more rows

# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#>          species     n not_NA_h mn_height not_NA_m md_mass
#>            <chr> <int>    <int>     <dbl>    <int>   <dbl>
#> 1 Yoda's species     1        1        66        1      17
#> 2         Aleena     1        1        79        1      15
#> 3           Ewok     1        1        88        1      20

# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>%  slice(1:3)
#> # A tibble: 3 x 6
#>   species     n not_NA_h mn_height not_NA_m md_mass
#>     <chr> <int>    <int>     <dbl>    <int>   <dbl>
#> 1    Hutt     1        1       175        1    1358
#> 2 Kaleesh     1        1       216        1     159
#> 3 Wookiee     2        2       231        2     124


## (+) Other questions: ----- 

# (f) How many individuals come from the 3 most frequent (known) species?
sw %>%
  group_by(species) %>%
  count %>%
  arrange(desc(n)) %>%
  filter(n > 1)
#> # A tibble: 9 x 2
#> # Groups:   species [9]
#>    species     n
#>      <chr> <int>
#> 1    Human    35
#> 2    Droid     5
#> 3     <NA>     5
#> 4   Gungan     3
#> 5 Kaminoan     2
#> 6 Mirialan     2
#> 7  Twi'lek     2
#> 8  Wookiee     2
#> 9   Zabrak     2

# (g) Which individuals are more than 20% lighter (in terms of mass) 
#     than the average mass of individuals of their own homeworld?
sw %>%
  select(name, homeworld, mass) %>%
  group_by(homeworld) %>%
  mutate(n_notNA_mass = sum(!is.na(mass)),  
         mn_mass = mean(mass, na.rm = TRUE),
         lighter = mass < (mn_mass - (.20 * mn_mass))
         ) %>%
  filter(lighter == TRUE)
#> # A tibble: 5 x 6
#> # Groups:   homeworld [4]
#>            name homeworld  mass n_notNA_mass  mn_mass lighter
#>           <chr>     <chr> <dbl>        <int>    <dbl>   <lgl>
#> 1         R2-D2     Naboo    32            6 64.16667    TRUE
#> 2   Leia Organa  Alderaan    49            2 64.00000    TRUE
#> 3         R5-D4  Tatooine    32            8 85.37500    TRUE
#> 4          Yoda      <NA>    17            3 82.00000    TRUE
#> 5 Padmé Amidala     Naboo    45            6 64.16667    TRUE

More on data transformation

For more details on dplyr,

Visualizing data

In the following, we introduce some essential commands of ggplot2 in the context of examples. However, the ggplot2 package extends far beyond this modest introduction – it is an important pillar (and predecessor) of the tidyverse and implements a language for and philosophy of data visualisation.

See Chapter 3: Data visualization) and Chapter 7: Exploratory data analysis (EDA) and the links provided below for more detailed information.

Commands and examples

General structure of ggplot calls

A generic template for creating a graph with ggplot is:

# Generic ggplot template: 
ggplot(data = <DATA>) + 
  <GEOM_fun>(mapping = aes(<MAPPING>), <arg_1 = val_1, ..., arg_n = val_n>) +
  <FACET_fun> +    # optional
  <LOOK_GOOD_fun>  # optional 
  
# Minimal ggplot template:
ggplot(<DATA>) + 
  <GEOM_fun>(aes(<MAPPING>) 

The generic template includes the following parts:

  • <DATA> is a data frame or tibble that contains the data that is to be plotted.

  • <GEOM_fun> is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that are specified in aes(<MAPPING>). (A “mapping” specifies what goes where.)

  • A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized
    1. in the aesthetic mapping (when varying visual features according to data properties), or
    2. by setting its arguments to specific values in <arg_1 = val_1, ..., arg_n = val_n> (when remaining constant).
  • An optional <FACET_fun> splits a complex plot into multiple subplots.

  • A sequence of optional <LOOK_GOOD_fun> adjusts the visual features of plots (e.g., by adding themes, plot titles and labels, color scales, and coordinate systems).

Some examples that illustrate the use of these components are:

A histogram

A histogram counts how often specific values of one (typically continuous) variable occur in the data. This allows viewing the distribution of values for this variable:

library(ggplot2)



# Data: ------ 
# Using mpg data:
?ggplot2::mpg
mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (A) Histogram: ------

# A minimal histogram:
hi1 <- ggplot(mpg, aes(x = cty)) +  # set mappings for ALL geoms
  geom_histogram(binwidth = 1) 
hi1


# The same histogram:
hi1b <- ggplot(mpg) +
  geom_histogram(aes(x = cty))      # set mappings for THIS geoms
hi1b


# (B) Adding aesthetics, labels and themes: ------ 

# Enhanced version of the same plot: 
hi2 <- ggplot(mpg) +
  geom_histogram(aes(x = cty), binwidth = 1, fill = "forestgreen", color = "black") +
  labs(title = "Distribution of fuel economy in city environments", 
       x = "cty (miles per gallon)",
       caption = "Data from ggplot2::mpg") +
  theme_light()
hi2

A scatterplot

A scatterplot shows a data point (observation) as a function of 2 (typically continuous) variables x and y. This allows judging the relationship between x and y in the data:


# (A) Scatterplot: ------ 

# A minimal scatterplot + reference line:
sp1 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline()
sp1

Dealing with overplotting

A common issue with scatterplots is so-called overplotting: Multiple points appear on the same position.

Here are some ways of dealing with this issue:

  1. jitter adds randomness to positions;
  2. alpha uses transparency to show frequency of positions;
  3. geom_size allows mapping values (e.g., frequency) to object size;
  4. facet_wrap allows disentangling plots by levels of variables.

Some examples include:

## Dealing with overplotting: ----- 

# 1. One way of dealing with overplotting is 
#    adding randomness to point positions:  
sp2 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), position = "jitter") +
  geom_abline()
sp2


# 2. Another way of dealing with overplotting is 
#    using transparency (via setting alpha to < 1): 
sp3 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), position = "identity", 
             pch = 21, fill = "steelblue", alpha = 1/4, size = 4) +
  geom_abline(linetype = 2, color = "firebrick") # + 
  # geom_rug(aes(x = cty, y = hwy), position = "jitter", alpha = 1/4, size = 1)
sp3


# Adding labels and themes to plots: 
sp4 <- sp3 +   # use the plot defined above
  labs(title = "Fuel economy on highway vs. city",
                x = "City (miles per gallon)",
                y = "Highway (miles per gallon)",
                caption = "Data from ggplot2::mpg") +
  # coord_fixed() +
  theme_bw()
sp4


# (C) Grouping (by a categorical variable): ------  

# Using facets to avoid overplotting: 
sp5 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline() + 
  facet_wrap(~class) +
  theme_bw()
sp5


# Grouping by color:
sp6 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy, color = class), 
             position = "jitter", alpha = 1/2, size = 4) +
  geom_abline(linetype = 2) +
  theme_bw()
sp6


# Grouping by facets: 
sp7 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), 
             position = "jitter", alpha = 1/2, size = 2) +
  geom_abline(linetype = 2) +
  facet_wrap(~class) +
  theme_bw()
sp7

See https://ggplot2.tidyverse.org/reference/ for more examples.

Note some details:

  • ggplot requires data and maps independent variables to dimensions (e.g., the x- and y-axis) and dependent variables to geometric objects (called “geoms”). It typically assumes that the to-be-plotted <DATA> is in a table (data frame or tibble) in long format and contains independent variables as factors.

  • The arguments data = and mappings = can be omitted, but an aesthetic mapping aes(<MAPPING>) for at least one geom is needed.

  • Different geoms can be combined, but their order matters (as later layers are printed on top of earlier ones).

  • When multiple geoms use the same mappings, their common aes(<MAPPING>) can be moved into the initial ggplot call (behind <DATA>).

  • In ggplot, a sequence of commands is combined by +, rather than %>%.

  • The visual appearance of plots are highly customizable (e.g., by supplying aesthetic arguments, speciying labels and legends, and applying pre-defined themes to plots).

EDA

Creating good graphs is both an art and a craft. The key to creating good graphs requires answering 2 sets of questions:

  1. Knowing the number and type of variables to be plotted. This includes answering data-related questions like

    • How many variables are there to plot?
    • Are these variables categorical or continuous?
    • Do some variables qualify (e.g., group) the values of others?
  2. Knowing the intended type of plot. This includes answering functional questions like

    • What is the purpose of this plot?
    • What are possible plots for this purpose?
    • Which of these would be the most appropriate plot?

Even when the questions of 1. and 2. are answered, creating good graphs with ggplot requires a lot of practice and many hours of trial-and-error experimentation.

Basic plot types

Histograms

A histogram shows counts of the values of 1 (typically continuous) variable. This is useful for evaluating the distribution of the variable:

library(ggplot2)
 
# Create data: 
tb <- tibble(iq = rnorm(n = 1000, mean = 100, sd = 15))
 
# Basic histogram:
ggplot(tb) + 
  geom_histogram(aes(x = iq), binwidth = 5)


# Pimped histogram: 
ggplot(tb) + 
  geom_histogram(aes(x = iq), binwidth = 5, 
                 fill = "gold", color = "black") +
  labs(title = "Histogram", x = "IQ values", y = "Frequency in sample (n)",
       caption = "[Using random iq data.]") +
  theme_classic()

More on histograms:

Scatterplots

A scatterplot shows relationship between 2 (typically continuous) variables:

# Data:
ir <- as_tibble(iris)
ir
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1         5.10        3.50         1.40       0.200 setosa 
#>  2         4.90        3.00         1.40       0.200 setosa 
#>  3         4.70        3.20         1.30       0.200 setosa 
#>  4         4.60        3.10         1.50       0.200 setosa 
#>  5         5.00        3.60         1.40       0.200 setosa 
#>  6         5.40        3.90         1.70       0.400 setosa 
#>  7         4.60        3.40         1.40       0.300 setosa 
#>  8         5.00        3.40         1.50       0.200 setosa 
#>  9         4.40        2.90         1.40       0.200 setosa 
#> 10         4.90        3.10         1.50       0.100 setosa 
#> # ... with 140 more rows

# Basic scatterplot:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species))


# Using 3 different facets:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  facet_wrap(~Species)


# Pimped scatterplot:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, fill = Species), pch = 21, color = "black", size = 2, alpha = 1/2) +
  facet_wrap(~Species) +
  # coord_fixed() + 
  labs(title = "Scatterplot", x = "Length of petal", y = "Width of petal",
       caption = "[Using iris data.]") + 
  theme_bw() +
  theme(legend.position = "none")

More on scatterplots:

Bar plots

Another common type of plot shows the values (across different levels of some variable as the height of bars. As this plot type can use both categorical or continuous variables, it turns out to be surprisingly complex to create good bar charts. To us get started, here are only a few examples:

Counts of cases

By default, geom_bar computes summary statistics of the data. When nothing else is specified, geom_bar counts the number or frequency of values (i.e., stat = "count") and maps this count to the y (i.e., y = ..count..):

library(ggplot2)

## Data: 
ggplot2::mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (1) Count number of cases by class: 
ggplot(mpg) + 
  geom_bar(aes(x = class))


# (b) is the same as: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count..))


# (c) is the same as:
ggplot(mpg) + 
  geom_bar(aes(x = class), stat = "count")


# (d) is the same as:
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count..), stat = "count")


# (e) pimped version:
ggplot(mpg) + 
  geom_bar(aes(x = class, fill = class), 
           # stat = "count", 
           color = "black") + 
  labs(title = "Counts of cars by class",
       x = "Class of car", y = "Frequency") + 
  scale_fill_brewer(name = "Class:", palette = "Blues") + 
  theme_bw()

Practice: Plot the number or frequency of cases in the mpg data by cyl (in at least 3 different ways).

Proportion of cases

An alternative to showing the count or frequency of cases is showing the corresponding proportion of cases:

library(ggplot2)

## Data: 
ggplot2::mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (1) Proportion of cases by class: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..prop.., group = 1))


# is the same as: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count../sum(..count..)))

Practice: Plot the proportion of cases in the mpg data by cyl (in at least 3 different ways).

Bar plots of existing values

A common difficulty occurs when the table to plot already contains the values to be shown as bars. As there is nothing to be computed in this case, we need to specify stat = "identity" for geom_bar (to override its default of stat = "count").

For instance, let’s plot a bar chart that shows the election data from the following tibble de:

year party share
2013 CDU/CSU 0.415
2013 SPD 0.257
2013 Others 0.328
2017 CDU/CSU 0.330
2017 SPD 0.205
2017 Others 0.465
  1. A version with 2 x 3 separate bars (using position = "dodge"):
## Data: ----- 
de  # => 6 x 3 tibble
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

## Note that year is of type character, which could be changed by:
# de$year <- parse_integer(de$year)

## (1) Bar chart with  side-by-side bars (dodge): ----- 

## (a) minimal version: 
bp_1 <- ggplot(de, aes(x = year, y = share, fill = party)) +
  ## (A) 3 bars per election (position = "dodge"):  
  geom_bar(stat = "identity", position = "dodge", color = "black") # 3 bars next to each other
bp_1


## (b) Version with text labels and customized colors: 
bp_1 + 
  ## pimping plot: 
  geom_text(aes(label = paste0(round(share * 100, 1), "%"), y = share + .01), 
            position = position_dodge(width = 1), 
            fontface = 2, color = "black") + 
  # Some set of high contrast colors: 
  scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) + 
  # Titles and labels: 
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") + 
  # coord_flip() + 
  theme_bw()

  1. A version with 2 bars with 3 segments (using position = "stack"):
## Data: ----- 
de  # => 6 x 3 tibble
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

## (2) Bar chart with stacked bars: -----  

## (a) minimal version: 
bp_2 <- ggplot(de, aes(x = year, y = share, fill = party)) +
  ## (B) 1 bar per election (position = "stack"):
  geom_bar(stat = "identity", position = "stack") # 1 bar per election
bp_2


## (b) Version with text labels and customized colors: 
bp_2 +   
  ## Pimping plot: 
  geom_text(aes(label = paste0(round(share * 100, 1), "%")), 
            position = position_stack(vjust = .5),
            color = rep(c("black", "white", "white"), 2), 
            fontface = 2) + 
  # Some set of high contrast colors: 
  scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) + 
  # Titles and labels: 
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") + 
  # coord_flip() + 
  theme_classic()

Bar plots with error bars

It is typically a good idea to show some measure of variability (e.g., the standard deviation, standard error, confidence interval, etc.) to any bar plots. There is an entire range of geoms that draw error bars:

## Create data to plot: ----- 
n_cat <- 6
set.seed(101)

data <- tibble(
  name = LETTERS[1:n_cat],
  value = sample(seq(25, 50), n_cat),
  sd = rnorm(n = n_cat, mean = 0, sd = 8))
data
#> # A tibble: 6 x 3
#>   name  value     sd
#>   <chr> <int>  <dbl>
#> 1 A        34  1.71 
#> 2 B        26  2.49 
#> 3 C        42  9.39 
#> 4 D        40  4.95 
#> 5 E        30 -0.902
#> 6 F        31  7.34

## Error bars: -----

## x-aesthetic only:

# (a) errorbar: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "steelblue") +
    geom_errorbar(aes(x = name, ymin = value - sd, ymax = value + sd), 
                  width = 0.4, color = "orange", alpha = 1, size = 1.0)


# (b) linerange: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "olivedrab3") +
    geom_linerange(aes(x = name, ymin = value - sd, ymax = value + sd), 
                   color = "firebrick", alpha = 1, size = 2.5)


## Additional y-aesthetic: 

# (c) crossbar:
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "tomato4") +
    geom_crossbar(aes(x = name, y = value, ymin = value - sd, ymax = value + sd), 
                  width = 0.3, color = "sienna1", alpha = 1, size = 1.0)


# (d) pointrange: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "burlywood4") +
    geom_pointrange(aes(x = name, y = value, ymin = value - sd, ymax = value + sd), 
                    color = "gold", alpha = 1.0, size = 1.2)

More on barplots:

+++ here now +++

Drawing curves and lines

  • adding trendlines
  • lines of data (e.g., means)

Box plots

  • show medians, quartiles, distribution, and outliers

Improving plots

Most default plots can be improved by fine-tuning their visual appearance. Popular levers for “pimping” plots include:

  • colors: can be set withing geoms (variable when inside aes(...), fixed outside), choosing or designing specific color scales;
  • labels: labs(...) allows setting titles, captions, axis labels, etc.;
  • legends: can be (re-)moved or edited;
  • themes: can be selected or modified.

More on data visualization

Conclusion

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data

[Last update on 2018-07-06 18:09:02 by hn.]


  1. This is different in Sankey diagrams, shown https://developers.google.com/chart/interactive/docs/gallery/sankey.